A race condition or race hazard is the condition of an electronics, software, or other system where the system's substantive behavior is Sequential logic on the sequence or timing of other uncontrollable events, leading to unexpected or inconsistent results. It becomes a software bug when one or more of the possible behaviors is undesirable.
The term race condition was already in use by 1954, for example in David A. Huffman's doctoral thesis "The synthesis of sequential switching circuits".Huffman, David A. " The synthesis of sequential switching circuits." (1954).
Race conditions can occur especially in or multithreaded or distributed software programs. Using mutual exclusion can prevent race conditions in distributed software systems.
Consider, for example, a two-input AND gate fed with the following logic: A logic signal on one input and its negation, (the ¬ is a Negation), on another input in theory never output a true value: . If, however, changes in the value of take longer to propagate to the second input than the first when changes from false to true then a brief period will ensue during which both inputs are true, and so the gate's output will also be true.
A practical example of a race condition can occur when logic circuitry is used to detect certain outputs of a counter. If all the bits of the counter do not change exactly simultaneously, there will be intermediate patterns that can trigger false matches.
A non-critical race condition occurs when the order in which internal variables are changed does not determine the eventual state that the state machine will end up in.
A dynamic race condition occurs when it results in multiple transitions when only one is intended. They are due to interaction between gates. It can be eliminated by using no more than two levels of gating.
An essential race condition occurs when an input has two transitions in less than the total feedback propagation time. Sometimes they are cured using inductive delay line elements to effectively increase the time duration of an input signal.
As well as these problems, some logic elements can enter metastable states, which create further problems for circuit designers.
Critical race conditions cause invalid execution and . Critical race conditions often happen when the processes or threads depend on some shared state. Operations upon shared states are done in that must be mutual exclusion. Failure to obey this rule can corrupt the shared state.
A data race is a type of race condition. Data races are important parts of various formal memory models. The memory model defined in the C11 and C++11 standards specify that a C or C++ program containing a data race has undefined behavior.
A race condition can be difficult to reproduce and debug because the end result is nondeterministic and depends on the relative timing between interfering threads. Problems of this nature can therefore disappear when running in debug mode, adding extra logging, or attaching a debugger. A bug that disappears like this during debugging attempts is often referred to as a "Heisenbug". It is therefore better to avoid race conditions by careful software design.
0 |
0 |
0 |
1 |
1 |
1 |
2 |
In the case shown above, the final value is 2, as expected. However, if the two threads run simultaneously without locking or synchronization (via semaphores), the outcome of the operation could be wrong. The alternative sequence of operations below demonstrates this scenario:
0 |
0 |
0 |
0 |
0 |
1 |
1 |
In this case, the final value is 1 instead of the expected result of 2. This occurs because here the increment operations are not mutually exclusive. Mutually exclusive operations are those that cannot be interrupted while accessing some resource such as a memory location.
This can be dangerous because on many platforms, if two threads write to a memory location at the same time, it may be possible for the memory location to end up holding a value that is some arbitrary and meaningless combination of the bits representing the values that each thread was attempting to write; this could result in memory corruption if the resulting value is one that neither thread attempted to write (sometimes this is called a 'torn write'). Similarly, if one thread reads from a location while another thread is writing to it, it may be possible for the read to return a value that is some arbitrary and meaningless combination of the bits representing the value that the memory location held before the write, and of the bits representing the value being written.
On many platforms, special memory operations are provided for simultaneous access; in such cases, typically simultaneous access using these special operations is safe, but simultaneous access using other memory operations is dangerous. Sometimes such special operations (which are safe for simultaneous access) are called atomic or synchronization operations, whereas the ordinary operations (which are unsafe for simultaneous access) are called data operations. This is probably why the term is data race; on many platforms, where there is a race condition involving only synchronization operations, such a race may be nondeterministic but otherwise safe; but a data race could lead to memory corruption or undefined behavior.
The C++ standard, in draft N4296 (2014-11-19), defines data race as follows in section 1.10.23 (page 14)
Two actions are potentially concurrent if
- they are performed by different threads, or
- they are unsequenced, and at least one is performed by a signal handler.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below omitted. Any such data race results in undefined behavior.
The parts of this definition relating to signal handlers are idiosyncratic to C++ and are not typical of definitions of data race.
The paper Detecting Data Races on Weak Memory SystemsAdve, Sarita & Hill, Mark & Miller, Barton & H. B. Netzer, Robert. (1991). Detecting Data Races on Weak Memory Systems. ACM SIGARCH Computer Architecture News. 19. 234–243. 10.1109/ISCA.1991.1021616. provides a different definition:
"two memory operations conflict if they access the same location and at least one of them is a write operation ... "Two memory operations, x and y, in a sequentially consistent execution form a race 〈x,y〉, iff x and y conflict, and they are not ordered by the hb1 relation of the execution. The race 〈x,y〉, is a data race iff at least one of x or y is a data operation.
Here we have two memory operations accessing the same location, one of which is a write.
The hb1 relation is defined elsewhere in the paper, and is an example of a typical "happens-before" relation; intuitively, if we can prove that we are in a situation where one memory operation X is guaranteed to be executed to completion before another memory operation Y begins, then we say that "X happens-before Y". If neither "X happens-before Y" nor "Y happens-before X", then we say that X and Y are "not ordered by the hb1 relation". So, the clause "... and they are not ordered by the hb1 relation of the execution" can be intuitively translated as "... and X and Y are potentially concurrent".
The paper considers dangerous only those situations in which at least one of the memory operations is a "data operation"; in other parts of this paper, the paper also defines a class of "synchronization operations" which are safe for potentially simultaneous use, in contrast to "data operations".
The Java Language Specification provides a different definition:
Two accesses to (reads of or writes to) the same variable are said to be conflicting if at least one of the accesses is a write ... When a program contains two conflicting accesses (§17.4.1) that are not ordered by a happens-before relationship, it is said to contain a data race ... a data race cannot cause incorrect behavior such as returning the wrong length for an array.
A critical difference between the C++ approach and the Java approach is that in C++, a data race is undefined behavior, whereas in Java, a data race merely affects "inter-thread actions". This means that in C++, an attempt to execute a program containing a data race could (while still adhering to the spec) crash or could exhibit insecure or bizarre behavior, whereas in Java, an attempt to execute a program containing a data race may produce undesired concurrency behavior but is otherwise (assuming that the implementation adheres to the spec) safe.
For example, in Java, this guarantee is directly specified:
A program is correctly synchronized if and only if all sequentially consistent executions are free of data races.If a program is correctly synchronized, then all executions of the program will appear to be sequentially consistent (§17.4.3).
This is an extremely strong guarantee for programmers. Programmers do not need to reason about reorderings to determine that their code contains data races. Therefore they do not need to reason about reorderings when determining whether their code is correctly synchronized. Once the determination that the code is correctly synchronized is made, the programmer does not need to worry that reorderings will affect his or her code.
A program must be correctly synchronized to avoid the kinds of counterintuitive behaviors that can be observed when code is reordered. The use of correct synchronization does not ensure that the overall behavior of a program is correct. However, its use does allow a programmer to reason about the possible behaviors of a program in a simple way; the behavior of a correctly synchronized program is much less dependent on possible reorderings. Without correct synchronization, very strange, confusing and counterintuitive behaviors are possible.
By contrast, a draft C++ specification does not directly require an SC for DRF property, but merely observes that there exists a theorem providing it:
[Note:It can be shown that programs that correctly use mutexes and memory_order_seq_cst operations to prevent all data races and use no other synchronization operations behave as if the operations executed by their constituent threads were simply interleaved, with each value computation of an object being taken from the last side effect on that object in that interleaving. This is normally referred to as “sequential consistency”. However, this applies only to data-race-free programs, and data-race-free programs cannot observe most program transformations that do not change single-threaded program semantics. In fact, most single-threaded program transformations continue to be allowed, since any program that behaves differently as a result must perform an undefined operation.— end note
Note that the C++ draft specification admits the possibility of programs that are valid but use synchronization operations with a memory_order other than memory_order_seq_cst, in which case the result may be a program which is correct but for which no guarantee of sequentially consistency is provided. In other words, in C++, some correct programs are not sequentially consistent. This approach is thought to give C++ programmers the freedom to choose faster program execution at the cost of giving up ease of reasoning about their program.
There are various theorems, often provided in the form of memory models, that provide SC for DRF guarantees given various contexts. The premises of these theorems typically place constraints upon both the memory model (and therefore upon the implementation), and also upon the programmer; that is to say, typically it is the case that there are programs which do not meet the premises of the theorem and which could not be guaranteed to execute in a sequentially consistent manner.
The DRF1 memory model provides SC for DRF and allows the optimizations of the WO (weak ordering), RCsc (Release Consistency with sequentially consistent special operations), VAX memory model, and data-race-free-0 memory models. The PLpc memory modelKourosh Gharachorloo and Sarita V. Adve and Anoop Gupta and John L. Hennessy and Mark D. Hill, Programming for Different Memory Consistency Models, Journal of Parallel and Distributed Computing, 1992, volume 15, pages 399–407. provides SC for DRF and allows the optimizations of the TSO (Total Store Order), PSO, PC (Processor Consistency), and RCpc (Release Consistency with processor consistency special operations) models. DRFrlx provides a sketch of an SC for DRF theorem in the presence of relaxed atomics.
A specific kind of race condition involves checking for a predicate (e.g. for authentication), then acting on the predicate, while the state can change between the time-of-check and the time-of-use. When this kind of computer bug exists in security-sensitive code, a security vulnerability called a time-of-check to time-of-use ( TOCTTOU) bug is created.
Race conditions are also intentionally used to create hardware random number generators and physically unclonable functions.
A different form of race condition exists in file systems where unrelated programs may affect each other by suddenly using up available resources such as disk space, memory space, or processor cycles. Software not carefully designed to anticipate and handle this race situation may then become unpredictable. Such a risk may be overlooked for a long time in a system that seems very reliable. But eventually enough data may accumulate or enough other software may be added to critically destabilize many parts of a system. An example of this occurred with the near loss of the Mars Rover "Spirit" not long after landing, which occurred due to deleted file entries causing the file system library to consume all available memory space. A solution is for software to request and reserve all the resources it will need before beginning a task; if this request fails then the task is postponed, avoiding the many points where failure could have occurred. Alternatively, each of those points can be equipped with error handling, or the success of the entire task can be verified afterwards, before continuing. A more common approach is to simply verify that enough system resources are available before starting a task; however, this may not be adequate because in complex systems the actions of other running programs can be unpredictable.
In this case of a race condition, the concept of the "shared resource" covers the state of the network (what channels exist, as well as what users started them and therefore have what privileges), which each server can freely change as long as it signals the other servers on the network about the changes so that they can update their conception of the state of the network. However, the network latency across the network makes possible the kind of race condition described. In this case, heading off race conditions by imposing a form of control over access to the shared resource—say, appointing one server to control who holds what privileges—would mean turning the distributed network into a centralized one (at least for that one part of the network operation).
Race conditions can also exist when a computer program is written with non-blocking sockets, in which case the performance of the program can be dependent on the speed of the network link.
Another example is the energy management system provided by GE Energy and used by Ohio-based FirstEnergy Corp (among other power facilities). A race condition existed in the alarm subsystem; when three sagging power lines were tripped simultaneously, the condition prevented alerts from being raised to the monitoring technicians, delaying their awareness of the problem. This software flaw eventually led to the North American Blackout of 2003. GE Energy later developed a software patch to correct the previously undiscovered error.
Thread Safety Analysis is a static analysis tool for annotation-based intra-procedural static analysis, originally implemented as a branch of gcc, and now reimplemented in Clang, supporting PThreads.
Dynamic analysis tools include:
There are several benchmarks designed to evaluate the effectiveness of data race detection tools
In UK railway signalling, a race condition would arise in the carrying out of Rule 55. According to this rule, if a train was stopped on a running line by a signal, the locomotive fireman would walk to the signal box in order to remind the signalman that the train was present. In at least one case, at Winwick in 1934, an accident occurred because the signalman accepted another train before the fireman arrived. Modern signalling practice removes the race condition by making it possible for the driver to instantaneously contact the signal box by radio.
Race conditions are not confined to digital systems. Neuroscience is demonstrating that race conditions can occur in mammal brains as well. One demonstrated race is between neural pathways that execute planned movement and separate pathways than can cancel said movement.
|
|